library(skimr)
library(tidyverse)
library(caret) # For featureplot, classification report
library(corrplot) # For correlation matrix and PCA contributionplots
library(AppliedPredictiveModeling)
library(mice) # For data imputation
library(VIM) # For missing data visualization
library(gridExtra) # For grid plots
library(dendextend) # For dendrograms
library(factoextra) # For PCA plotsThis dataset is composed of real patient responses to two questionnaires related to ADHD and Mood Disorder and a variety of demographic, abuse and drug use variarables. For each questionnaire, the responses to individual questions are provided along with total scores. Links to the actual questions are provided below:
This work will make use of unsupervised learning techniques such as Principal Component Analysis (PCA) and clustering in an attempt to discover underlying structures in the data.
The dataset is composed of 54 variables and 175 observations. The data is coded as numeric and holds 33 observations that have some level of missing data. A summary of the variable distributions is provided below:
adhd_raw <- read.csv('https://raw.githubusercontent.com/maelillien/cuny-projects/main/unsupervised-pca-clustering/adhd_data.csv', header = TRUE)
adhd <- adhd_raw
head(adhd)| Name | adhd %>% select(-c(ADHD.Q… |
| Number of rows | 175 |
| Number of columns | 21 |
| _______________________ | |
| Column type frequency: | |
| factor | 1 |
| numeric | 20 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| Initial | 0 | 1 | FALSE | 109 | DB: 5, CM: 4, DJ: 4, JM: 4 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Age | 0 | 1.00 | 39.47 | 11.17 | 18 | 29.5 | 42 | 48.0 | 69 | ▆▅▇▅▁ |
| Sex | 0 | 1.00 | 1.43 | 0.50 | 1 | 1.0 | 1 | 2.0 | 2 | ▇▁▁▁▆ |
| Race | 0 | 1.00 | 1.64 | 0.69 | 1 | 1.0 | 2 | 2.0 | 6 | ▇▁▁▁▁ |
| ADHD.Total | 0 | 1.00 | 34.32 | 16.70 | 0 | 21.0 | 33 | 47.5 | 72 | ▃▆▇▆▂ |
| MD.TOTAL | 0 | 1.00 | 10.02 | 4.81 | 0 | 6.5 | 11 | 14.0 | 17 | ▃▃▆▇▇ |
| Alcohol | 4 | 0.98 | 1.35 | 1.39 | 0 | 0.0 | 1 | 3.0 | 3 | ▇▂▁▁▆ |
| THC | 4 | 0.98 | 0.81 | 1.27 | 0 | 0.0 | 0 | 1.5 | 3 | ▇▁▁▁▃ |
| Cocaine | 4 | 0.98 | 1.09 | 1.39 | 0 | 0.0 | 0 | 3.0 | 3 | ▇▁▁▁▅ |
| Stimulants | 4 | 0.98 | 0.12 | 0.53 | 0 | 0.0 | 0 | 0.0 | 3 | ▇▁▁▁▁ |
| Sedative.hypnotics | 4 | 0.98 | 0.12 | 0.54 | 0 | 0.0 | 0 | 0.0 | 3 | ▇▁▁▁▁ |
| Opioids | 4 | 0.98 | 0.39 | 0.99 | 0 | 0.0 | 0 | 0.0 | 3 | ▇▁▁▁▁ |
| Court.order | 5 | 0.97 | 0.09 | 0.28 | 0 | 0.0 | 0 | 0.0 | 1 | ▇▁▁▁▁ |
| Education | 9 | 0.95 | 11.90 | 2.17 | 6 | 11.0 | 12 | 13.0 | 19 | ▁▅▇▂▁ |
| Hx.of.Violence | 11 | 0.94 | 0.24 | 0.43 | 0 | 0.0 | 0 | 0.0 | 1 | ▇▁▁▁▂ |
| Disorderly.Conduct | 11 | 0.94 | 0.73 | 0.45 | 0 | 0.0 | 1 | 1.0 | 1 | ▃▁▁▁▇ |
| Suicide | 13 | 0.93 | 0.30 | 0.46 | 0 | 0.0 | 0 | 1.0 | 1 | ▇▁▁▁▃ |
| Abuse | 14 | 0.92 | 1.33 | 2.12 | 0 | 0.0 | 0 | 2.0 | 7 | ▇▂▁▁▁ |
| Non.subst.Dx | 22 | 0.87 | 0.44 | 0.68 | 0 | 0.0 | 0 | 1.0 | 2 | ▇▁▃▁▁ |
| Subst.Dx | 23 | 0.87 | 1.14 | 0.93 | 0 | 0.0 | 1 | 2.0 | 3 | ▆▇▁▅▂ |
| Psych.meds. | 118 | 0.33 | 0.96 | 0.80 | 0 | 0.0 | 1 | 2.0 | 2 | ▇▁▇▁▆ |
The dataset is modified to include an EducationLevel categorical variable derived from the numerical Education variables representing the years of schooling. The Abuse column is unfolded into 3 binary variables indicating the occurence of the 3 types of abuse. The original Abuse variable is dropped.
We work with a multiple subsets of the data for subsequent parts this report. Some analyses make use of the entire set of questionnaire reponses while others use only the total score.
# Renaming variables
adhd <- rename(adhd, ADHDTotal = ADHD.Total, MDTotal = MD.TOTAL, Sedatives = Sedative.hypnotics, CourtOrder = Court.order,
Violence = Hx.of.Violence, Conduct = Disorderly.Conduct, NonSubstDX = Non.subst.Dx, SubstDX = Subst.Dx, PsychMeds = Psych.meds.)
# Drop Initial column
adhd <- adhd %>% select(-c(Initial))
# Shift sex variable to 0 and 1
adhd$Sex <- adhd$Sex-1
# Re-coding Education Variable, College = Education > 12, HS = Education > 8 & <= 12, MS = Education <= 8
adhd$EducationLevel <- ifelse(adhd$Education <= 8, 1, ifelse(adhd$Education <= 12 & adhd$Education > 8, 2, ifelse(adhd$Education > 12, 3, 99)))
# Creating new variables based on type of abuse
adhd$AbuseP <- as.numeric(adhd$Abuse == 1 | adhd$Abuse == 4 | adhd$Abuse == 5 | adhd$Abuse == 7)
adhd$AbuseS <- as.numeric(adhd$Abuse == 2 | adhd$Abuse == 4 | adhd$Abuse == 6 | adhd$Abuse == 7)
adhd$AbuseE <- as.numeric(adhd$Abuse == 3 | adhd$Abuse == 5 | adhd$Abuse == 6 | adhd$Abuse == 7)
adhd <- adhd %>% select(-c(Abuse))
# Forming data subsets: full set of variables or reduced (totals only)
adhd.full <- adhd
adhd.red <- adhd %>% select(c(Age, Sex, Race, ADHDTotal, MDTotal, Alcohol:AbuseE))
# Set all characters to numeric
adhd.red <- mutate_all(adhd.red, function(x) as.numeric(as.character(x)))
adhd.full <- mutate_all(adhd.full, function(x) as.numeric(as.character(x)))The dataset contains a few missing values. The PsychMeds variable mostly contained missing values and was dropped entirely. A few observations were quite sparse and only contained basic demographic and questionnaire score columns. In order to avoid biasing the dataset with imputed values, we preferred to drop all observations with missing values from the dataset. The resulting dataset contains 33 fewer observations with 142 complete rows and 19 columns.
##
## Variables sorted by number of missings:
## Variable Count
## PsychMeds 0.67428571
## SubstDX 0.13142857
## NonSubstDX 0.12571429
## AbuseP 0.08000000
## AbuseS 0.08000000
## AbuseE 0.08000000
## Suicide 0.07428571
## Violence 0.06285714
## Conduct 0.06285714
## Education 0.05142857
## EducationLevel 0.05142857
## CourtOrder 0.02857143
## Alcohol 0.02285714
## THC 0.02285714
## Cocaine 0.02285714
## Stimulants 0.02285714
## Sedatives 0.02285714
## Opioids 0.02285714
## Age 0.00000000
## Sex 0.00000000
## Race 0.00000000
## ADHDTotal 0.00000000
## MDTotal 0.00000000
# Discard PsychMeds
adhd.red.complete <- adhd.red %>% select(-c(PsychMeds))
adhd.full.complete <- adhd.full %>% select(-c(PsychMeds))
# Keep only complete cases
adhd.red.complete <- adhd.red.complete[complete.cases(adhd.red.complete),]
adhd.full.complete <- adhd.full.complete[complete.cases(adhd.full.complete),]The plots below explore the distributions of subsets of the data.
The various distribution of the responses to the ADHD questionnaire are displayed below. Some variables are unimodal, others bimodal, some are centered and others skewed. Reference should be made to the survey questions in order to gain deeper insight of these distributions.
For the Mood Disorder questionnaire, the majority of the binary responses were 1. The classes are fairly balance expect for a few questions. Q2 is particularly unbalanced. The distribution of the responses to Q3 is highly skewed to the left tail.
A few observations on the binary and ordinal variables:
Clustering refers to a broad set of techniques for finding subgroups, or clusters, in a dataset. We seek to partition observations into distinct groups so that the observations within each group are quite similar to each other, while observations in different groups are quite different from each other. The most popular clustering approaches are K-means and Hierarchical Clustering (HC). While the former requires a pre-specified number of clusters k, the latter does not. HC is a bottom-up or agglomerative clustering approach which results in an upside-down tree representation, built from the leaves and combined into clusters up to the trunk. Clusters are identified by horizontal cuts across the dendrogram.
In this section, we explore the use of Hierarchical Clustering on two portions of the data. The first uses only the questionnaire responses to ADHD while the second uses the total questionnaire scores for both surveys as well as the other variables (demographic, drugs, abuse, etc). The latter is referred to as the ‘reduced’ dataset.
Clustering typically requires the variables to be scaled in order to avoid more weight to variables using a larger range of values. However, when all the variables under conideration are measured on the same scale, which is the case when only comparing survey responses, it can be appropriate to leave the variables unscaled.
With HC, the concept of dissimilarity between a pair of observations needs to be extended to a pair of groups of observations. This extension is achieved with the notion of linkage, which defines the dissimilarity between two groups of observations. The resulting dendrogram heavily depends on the choice of linkage. The most popular linkages are complete and average because they tend to result in more balanced clusters.
Using only the individual unscaled responses to the ADHD Questionnaire, we obtain the following dendrogram structure using complete linkage. In this case, complete linkage provided the best balancing and a cutoff into 3 clusters looked appropriate. In order to gain insight into these clusters, we need to look at the distribution of the variables within each of them.
For these 3 clusters, we can make the following observations:
We can establish a ranking for this clustering based on the monotic rise in the meanADHD and meanMD across clusters. From less to most severe: Cluster 3, Cluster 2, Cluster 1. With this information, a treatment plan could be developped to focus on clusters 2 and 1, with the latter requiring special attention.
hc0 <- adhd.full %>% select(ADHD.Q1:ADHD.Q18) %>% dist(method = "euclidean") %>% hclust(method = "complete")
dend0 <- hc0 %>% as.dendrogram
sub_grp0 <- cutree(dend0, k = 3, order_clusters_as_data = TRUE)
dend0 %>% set("branches_k_color", k = 3) %>% set("labels", "") %>% plot(main = "Hierarchical Clustering ADHD Questionnaire")adhd.full %>%
mutate(cluster = sub_grp0) %>%
group_by(cluster) %>%
summarise(meanAge = mean(Age), meanMD = mean(MDTotal), meanADHD = mean(ADHDTotal), count = n()) %>%
gather(var,value,meanAge:count) %>%
ggplot(aes(cluster,value,fill=cluster)) +
geom_col() + facet_grid(var ~ ., scales="free_y") +
geom_text(aes(label=round(value,1)), vjust=1.6, color="white", size=3.5) +
ggtitle('ADHD Questionnaire Cluster Distribution') +
theme_minimal()For a contrasting analysis, we looked at clustering based on the dataset which included only the total questionnaire scores and dropped observations with missing values. The variables were scaled to balance out the contribution of the high values for scores and age. We explore the use of complete, average and Ward linkages.
Unlike the previous clustering, adding the other variables results in a less balanced dendrogram. In this case, most clusters contain only a few observations and two large clusters contain the majority of the data. We can make a few additional observations:
Based on the variable distributions, it might be tempting to want to group clusters 4 and 5 together since they represent a cohort with similar meanAge, meanMD and meanADHD scores. However, hierachichal clustering considers more variables than the aforementioned few and this kind of grouping would violate the measure of similarity as determined by the y axis of the dendrogram.
A proposed ADHD and Mood Disorder ranking in 4 levels from least severe to most severe is: Cluster 2, Clusters 1, 4 & 5, Clusters 3 (older) & 6 (younger).
hc1 <- adhd.red.complete %>% scale %>% dist(method = "euclidean") %>% hclust(method = "complete")
dend1 <- hc1 %>% as.dendrogram
sub_grp1 <- cutree(dend1, k = 6)
dend1 %>% set("branches_k_color", k = 6) %>% set("labels", "") %>% plot(main = "Hierachical Clustering Reduced Dataset + Complete Linkage")adhd.red.complete %>%
mutate(cluster = sub_grp1) %>%
group_by(cluster) %>%
summarise(meanAge = mean(Age), meanMD = mean(MDTotal), meanADHD = mean(ADHDTotal), count = n()) %>%
gather(var,value,meanAge:count) %>%
ggplot(aes(cluster,value,fill=cluster)) +
geom_col() + facet_grid(var ~ ., scales="free_y") +
geom_text(aes(label=round(value,1)), vjust=1.6, color="white", size=3.5) +
ggtitle('Reduced Dataset + Complete Linkage Cluster Distribution') +
theme_minimal()Average linkage completely changes the dendrogram representation. The number of clusters, k=5 seemed appropriate. Similarly to complete linkage, one large cluster is identified, this time containing the vast majority of the observations. This obscures the analysis beyond what we can establish about the small clusters.
hc2 <- adhd.red.complete %>% scale %>% dist(method = "euclidean") %>% hclust(method = 'average')
dend2 <- hc2 %>% as.dendrogram
sub_grp2 <- cutree(dend2, k = 5)
dend2 %>% set("branches_k_color", k = 5) %>% set("labels", "") %>% plot(main = "Hierachical Clustering Reduced Dataset + Average Linkage")adhd.red.complete %>%
mutate(cluster = sub_grp2) %>%
group_by(cluster) %>%
summarise(meanAge = mean(Age), meanMD = mean(MDTotal), meanADHD = mean(ADHDTotal), count = n()) %>%
gather(var,value,meanAge:count) %>%
ggplot(aes(cluster,value,fill=cluster)) +
geom_col() + facet_grid(var ~ ., scales="free_y") +
geom_text(aes(label=round(value,1)), vjust=1.6, color="white", size=3.5) +
ggtitle('Reduced Dataset + Average Linkage Cluster Distribution') +
theme_minimal()While average linkage is a popular option, the resulting clustering above was somewhat unattractive. Here we consider another linkage method to expand the analysis. Ward linkage makes use of Ward’s minimum variance criterion which seeks to minimize the total within-cluster variance. The resulting dendrogram is visually attractive and more balanced than before. A cutoff at k=3 clusters seems appropriate. This grouping is more similar to the one that relied only on the ADHD questionnaire responses. Here, Cluster 3 is the oldest group but also has the lowest meanADHD and meanMD scores. Clusters 1 and 2 are less obvious to differentiate since Cluster 2 has the highest meanADHD score but a lower meanMD score than Cluster 1.
hc3 <- adhd.red.complete %>% scale %>% dist(method = "euclidean") %>% hclust(method = 'ward.D')
dend3 <- hc3 %>% as.dendrogram
sub_grp3 <- cutree(dend3, k = 3)
dend3 %>% set("branches_k_color", k = 3) %>% set("labels", "") %>% plot(main = "Hierachical Clustering Reduced Dataset + Ward Linkage")adhd.red.complete %>%
mutate(cluster = sub_grp3) %>%
group_by(cluster) %>%
summarise(meanAge = mean(Age), meanMD = mean(MDTotal), meanADHD = mean(ADHDTotal), count = n()) %>%
gather(var,value,meanAge:count) %>%
ggplot(aes(cluster,value,fill=cluster)) +
geom_col() + facet_grid(var ~ ., scales="free_y") +
geom_text(aes(label=round(value,1)), vjust=1.6, color="white", size=3.5) +
ggtitle('Reduced Dataset + Ward Linkage Cluster Distribution') +
theme_minimal()We use a tanglegram to obtain side-by-side comparisons on the clusters obtained using the different linkage methods. Here we only compare the complete and average linkage clusterings which are similar in terms of cluster imbalance. The first thing to notice is the consistency of groupings at the higher ends of the hiearchies. This is shown by the nearly horizontal ribbon linking the two clustering and indicates that the selected observations end up in the corresponsing cluster representation. Naturally, a large number of observations find correspondence in the opposing largest clusters.
dl <- dendlist(
dend1 %>%
set("labels_col", k=6) %>%
set("branches_lty", 1) %>%
set("branches_k_color", k = 6),
dend2 %>%
set("labels_col", k=5) %>%
set("branches_lty", 1) %>%
set("branches_k_color", k = 5)
)
# Plot them together
tanglegram(dl,
common_subtrees_color_lines = TRUE, highlight_distinct_edges = TRUE, highlight_branches_lwd=FALSE,
margin_inner=7,
lwd=2,
show_labels = FALSE
)
title("Complete Linkage vs Average")The comparison of complete linkage against ward linkage is interesting since the latter contains only half the number of clusters. It might be worth studying how the patients in the middle cluster of the ward representation (lowest ADHD and MD) find correspondence in the complete linkage clusters. Similarly, we could study in more detail how individuals in Clusters 3 and 6 (using complete linkage) which were small but extreme clusters end up grouped in the ward representation which had less defined levels of severity in terms of ADHD and MD.
dl <- dendlist(
dend1 %>%
set("labels_col", k=6) %>%
set("branches_lty", 1) %>%
set("branches_k_color", k = 6),
dend3 %>%
set("labels_col", k=3) %>%
set("branches_lty", 1) %>%
set("branches_k_color", k = 3)
)
# Plot them together
tanglegram(dl,
common_subtrees_color_lines = TRUE, highlight_distinct_edges = TRUE, highlight_branches_lwd=FALSE,
margin_inner=7,
lwd=2,
show_labels = FALSE
)
title("Complete Linkage vs Ward")Principal Comnponent Analysis (PCA) is a dimensionality reduction technique where a dataset is transformed to use p eigenvectors of the covariance matrix instead of the original number of predictors n, where p < n. The number of eigenvectors p is selected by looking at the sorted eigenvalues and determining a threshold percentage of variance explained and the resulting p.
The method seeks to project the data into a lower dimensional space where each axis (or principal component) captures the most variability in the data subject to the condition of being uncorrelated to the other axes. This last condition is important for dimensionality reduction in the sense that large datasets can contain many correlated variables which hold no additional information.
An eigenvalue > 1 indicates that PCs account for more variance than accounted by one of the original variables in standardized data. This is commonly used as a cutoff point for which PCs are retained. This holds true only when the data are standardized. We can also limit the number of component to that number that accounts for a certain fraction of the total variance, for example 70%.
This section focuses on the subset of the data containing only the individual responses to the ADHD questionnaire. The table below displays the first 10 eigenvalues obtained from the decomposition. Following from the cutoff decription above, our selection of dimensions can be based on the number of scaled eigenvalues that are greater than 1 or on a certain percentage of cummulative variance explained. Another way to select the number of PCs to consider is to study the scree plot provided below, which is simply a visual representation of the variance explained by each component. We typically look for an elbow in the plot to make our selection. In our case, the scree plot elbow occurs at Dimensions = 2 and the eigenvalue threshold at Dimensions = 3 which account for 59.2% and 64.8% respectively.
We can study the individual contributions of each questionnaire response to the principal components using the plot below. We observe that the contributions to the first principal component (Dim. 1) by the variables is roughly equivalent. However, Q16 bears a lot of weight in the 2nd dimension as Q5 does in the third dimension.
res.pca.adhd <- adhd.full %>% select(ADHD.Q1:ADHD.Q18) %>% prcomp(scale = FALSE)
corrplot(get_pca_var(res.pca.adhd)$contrib, is.corr=FALSE) The cummulative contribution across the first 3 dimensions is summarized below. Q5 bears the most overall importance, followed by Q16 and Q4. It is worth diving into the actual questionnaire to look up what each question is asking to see if any insights can be drawn to explain the variability. The questions are listed below.
adhd.full %>% select(ADHD.Q1:ADHD.Q18) %>%
prcomp(scale = FALSE) %>%
fviz_contrib(choice = "var", axes = 1:3, top = 10)We can visualize these contribution in 2 dimensions using the first two PCs as shown below. On this plot, we look at the groupings and directions of the vectors. Positively correlated variables are grouped together. Negatively correlated variables are positioned on opposite sides of the plot origin. The distance between variables and the origin measures the contribution of the variables. We make the following observations:
fviz_pca_var(res.pca.adhd,
col.var = "contrib", # Color by contributions to the PC
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)Using the resduced dataset, we can use PCA to extract additional information about the patients. Since we are using total scores for this section, the variables must be scaled in order to avoid any overweighted contributions. We find that it takes 9 dimensions to explain approximately 70% of the variance. The scree plot has no obvious elbow which can be used to determine a cutoff. Using the eigenvalue cutoff at Dimension 8, we can still capture 66% of the variance.
From the correlation and variable contribution plots below, we can make a few observations:
res.pca <- adhd.red.complete %>% prcomp(scale = TRUE)
corrplot(get_pca_var(res.pca)$contrib, is.corr=FALSE) Cummulatively for PC1 to PC8, we can observe from the below that Education and its derivative EducationLevel is the largest individual contributor to the principal components, followed by SubstDX, Age and Race. The two ADHD and Mood Disorder questionnaires contribute roughly equivalently to the first 8 dimensions.
The 2D representation of the variable contributions to PC1 and PC2 are shown below. We note the following:
In summary, the most insightful observations from this PCA analysis might be that Suicide and Abuse are generally correlated with both ADHD and Mood Disorder. Additionally, Mood Disorder is more closely correlated with substance-related abuse as well as abuse of Alcohol and THC, Violence and Conduct than ADHD.
fviz_pca_var(res.pca,
col.var = "contrib", # Color by contributions to the PC
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)The biplot representation below displays the same information as above but overlays the individual patient numbers. This view allows the selection of individual observations for a deeper dive.
fviz_pca_biplot(res.pca,
col.var = "contrib", # Color by contributions to the PC
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)